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GPU-Driven Rendering? 


* GPU controls what objects are actually 
rendered 
* "draw scene’ GPU-command 
— n viewports/frustums 
— GPU determines (sub-)object visibility 
— No CPU/GPU roundtrip 
e Prior work [SBOTOS] 
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Motivation (RedLynx) 
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Modular construction using in-game level editor 

High draw distance. Background built from small objects. 
No baked lighting. Lots of draw calls from shadow maps. 
CPU used for physics simulation and visual scripting 
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Motivation 
Assassin's or Unity. 
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Motivation 
Assassin's Creed Unity 


e Modular construction (partially 
automated) 

• ~10x instances compared to 
previous Assassin's Creed 
games 


• CPU scarcest resource on 
consoles 
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Mesh Cluster Rendering 


Fixed topology (64 vertex strip) 


Split & rearrange all meshes to fit fixed 
topology (insert degenerate triangles) 


Fetch vertices manually in VS from 
shared buffer [Riccio13] 


Drawlnstancedindirect 
GPU culling outputs cluster list 
& drawcall args 
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Mesh Cluster Rendering 


Arbitrary number of meshes in 
single drawcall 


GPU-culled by cluster bounds 
[Greene93] [Shopf08] [Hill11] 


Faster vertex fetch 
Cluster depth sorting 
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Mesh Cluster Rendering (ACU) 


Problems with triangle strips: 

— Memory increase due to degenerate triangles 
— Non-deterministic cluster order 
MultiDrawlndexedlnstancedindirect: 

— One (sub-)drawcall per instance 

— 64 triangles per cluster 

— Requires appending index buffer on the fly 
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Rendering Pipeline Overview 
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Rendering pipeline overview 


CPU quad tree culling 

Per instance data: 

— E.g. transform, LOD factor... 
— Updated in GPU ring buffer 
— Persistent for static instances 
Drawcall hash build on non-instanced data: 
— E.g. material, renderstate, ... 

Drawcalls merged based on hash 
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Rendering Pipeline Overview 


Rendering Pipeline Overview 
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Rendering Pipeline Overview 


Rendering Pl 


peline Overview 


Rendering Pipeline Overview 


INSTANCE CULLING (FRUSTUM/OCCLUSION) = 


<< Frame, 32.44 ms 


CLUSTER CHUNK EXPANSION 


CLUSTER CULLING 
(FRUSTUM/OCCLUSION/TRIANGLE BACKFACE ) 


! 
| 
INDEX BUFFER COMPACTION 4 4 4 4 


Instance0 Instance! Instance2 


—— as 
Compacted 1паех butter 


MULTI -DRAW 
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Rendering Pipeline Overview 


INSTANCE CULLING (FRUSTUM/OCCLUSION) 


CLUSTER CHUNK EXPANSION 


CLUSTER CULLING 
(FRUSTUM/OCCLUSION/TRIANGLE BACKFACE) 


INDEX BUFFER COMPACTION 


MULTI -DRAW 
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Rendering Pipeline Overview 
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static Triangle Backface Culling 


e Bake triangle visibility for 
pixel frustums of cluster 
Р M centered cubemap 
ў = e Cubemap lookup based 
on camera 
Ф Fetch 64 bits for visibility 
of all triangles in cluster 
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Static Triangle ВасКїасе Culling 


e Only one pixel per cubemap face (6 bits 
per triangle) 


* Pixel frustum is cut at distance to increase 
culling efficiency (possible false positives 
at oblique angles) 


e 10-3096 triangles culled 
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Occlusion Depth Generation 


Depth pre-pass with best occluders 


Rendered in full resolution for High- 
Z and Early-Z 


Downsampled to 512x256 


Combined with reprojection of last 
frame's depth 


Depth hierarchy for GPU culling 
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Occlusion Depth Generation 


300 best occluders (-600us) 


Rendered in full resolution for High 
Z and Early-Z 


Downsampled to 512x256 (100us) 
Combined with reprojection of last 
frame's depth (50us) 


Depth hierarchy for GPU culling 
(50us) 


(*Р54 performance ) 
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Shadow Occlusion Depth Generation 


For each cascade 

Camera depth reprojection (-70us) 
Combine with shadow depth 
reprojection (10us) 


Depth hierarchy for GPU culling 
(30и) 
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Camera Depth R 
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Camera Depth R 
1 рор tion 


Camera Depth R 
ү рор tion 


Camera Depth Reprojection 


Light Space Reprojection 
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Camera Depth Reprojection 


Reprojection “shadow” of the building 
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Сатега Depth Reprojection 


Similar to [Silvennoinen1 2] 
But, mask not effective because of fog: 
— Cannot use min-depth 
— Cannot exclude far-plane | | n 
64x64 pixel reprojection = 7 
Hd pre-process 202-0 ted remove е redundant 


ць, 


= —: - = > йы, 
= - =>» = = = "ш — 
ЕР - Б = т = чеха 
c = = = = > 


e 799 


SIGGRAPH NS Advances in Real-Time Rendering course ^ | = бы. 


CPU: IN 
* 1-2 Orders of парни less Ж, 
ТІКЕ раша 5 AC wi x Ode ` 
* 20-40% triangles culled (ради басе + си | 


се + cluste 
и 3 
* Only small overall gain: «109701 geomet 
* 30-80% shadow triangles cu lled EA: 
"2% 
Work іп progress: 
* More GPU-driven for static Ê 5 


* Моге batch friendly data — 
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RedLynx Topics 


Virtual Texturing in GPU-Driven Rendering 
Virtual Deferred Texturing 

MSAA Trick 

Two-Phase Occlusion Culling 

Virtual Shadow Mapping 
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Virtual Texturing 


Key idea: Keep only the visible 
texture data in memory [Hall99] 


Virtual 256k? texel atlas 
128? texel pages 


8k2 texture page cache 


— 5 slice texture array: Albedo, 
specular, roughness, normal, etc. 


— DXT compressed (BC5 / ВОЗ) 
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Virtual texturing is the biggest 
difference between our and AC: 
Unity's renderer 


Key feature: All texture data is 
available at once, using just a 
single texture binding 


No need to batch by textures! 
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Single Draw Са! bd 


Viewport = single draw call (x2) 
Dynamic branching for different 
vertex animation types 

— Fast on modern GPUs (+2% cost) 


Cluster depth sorting provides 
gain similar to depth prepass 


Cheap OIT with inverse sort 
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e Complex material blends and 
decal rendering results are 
stored to VI page cache 


e Data reuse amortizes costs 
over hundreds of frames 

* Constant memory footprint, 
regardless of texture resolution 
and the number of assets 
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Virtual Deferred Texturing 


* Old Idea: Store UVs to the G- 
buffer instead of texels [Auf.07] 

* Key feature: VI page cache 
atlas contains all the currently 
visible texture data 

e 16+16 bit UV to the 8k? texture 
atlas gives us 8 x 8 subpixel 
filtering precision 
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Gradients апа Tangent Frame 


Calculate pixel gradients in 
screen space. UV distance 
used to detect neighbors. 


No neighbors found > bilinear 


Tangent frame stored as a 32 
bit quaternion [Frykholm09] 


Implicit mip and material id 
from VT. Page = UV. xy 7 128. 
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Recap 4 Advantages 


64 bits. Full fill rate. No MRT. 


Overdraw Is dirt cheap 
— Texturing deferred to lighting CS 


Quad efficiency less important 


Virtual texturing page ID pass 
is no longer needed 
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MSAA Trick 


* Key Observation: UV and 
tangent can be interpolated 

e Idea: Render the scene at 2x2 
lower resolution (540p) with 
ordered grid 4xMSAA pattern 

e Use Texture2DMS.Load() to 
read each sample separately in ЖЕУ +308 
the lighting compute shader һ=р+106+108 
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1080р Reconstruction 


Reconstruct 1080р into LDS 


Edge pixels are perfectly 
reconstructed. MSAA runs the 
pixel shader for both sides. 
Interpolate the inner pixels’ UV 
and tangent 

Quality is excellent. Differences  »=c+zc4+700 
are hard to spot. P, =D +2DC+>DB 
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8xMSAA Trick Benchmark 


e 128 bpp G-Buffer 
e One pixel is a 2x2 tile of "2xMSAA pixels” 


e Xbox One: 1080p + MSAA + 60 fps © 


G-buffer rendering time 
Pixel shader waves 


DRAM memory traffic 
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Two-Phase Occlusion Culling 


No extra occlusion pass with low ENDE, 
poly proxy geometry Ge ees ыт 
Precise WYSIWYG occlusion 
Based on depth buffer data 


Depth pyramid generated from 
HTILE min/max buffer 


O(1) occlusion test (gather4) 
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Two-Phase Occlusion Culling 


1°! phase 


— Cull objects & clusters using 
last frame's depth pyramid 


— Render visible objects 


I 
I 
Object list / 


— Refresh depth pyramid м > қ | 
ер so 

- Test culled objects & clusters oa 

— Render false negatives та | 
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Depth sort 
clusters 


Benchmark 


* “Torture” unit test scene 
— 250,000 separate moving objects 
— 1 GB of mesh data (10k+ meshes) 
— 8k? texture cache atlas 
* DirectX 11 code path 
— 64 vertex clusters (strips) 
— No Executelndirect / MultiDrawlndirect 


e Only мо DrawInstancedIndirect calls 
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Benchmark Results 
Xbox One, 1080p 


CS A | тен” 


Object culling + LOD 0.28 ms 0.26 ms 
Cluster culling 0.09 ms 0.04 ms 


Draw (G-buffer) 1.60 ms < 0.01 ms 


Pyramid generation 0.06 ms 


Total 2.3 ms 


CPU time: 0.2 milliseconds (single Jaguar CPU core) 
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Virtual Shadow Au 


128k? virtual shadow map 
2562 texel pages 


Identify needed shadow pages 
from the z-buffer [Fernando01 |. 
Cull shadow pages with the 
GPU-driven pipeline. 

Render all pages at once. 
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VTSM Quality and Performance 


Close to 1:1 shadow-to-screen 
resolution in all areas 
Measured: Up to 3.5x faster 
than SDSM [Lauritzen1 0] in 
complex “sparse” scenes 

e Virtual SM slightly slower than 
SDSM 8 CSM in simple scenes 
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GPU-Driven Rendering + DX12 


NEW DX12 (PC) FEATURES FEATURES IN OTHER APIs 


Executelndirect Custom MSAA patterns 
Asynchronous Compute GPU side dispatch 

VS RT index (GS bypass) SIMD lane swizzles 

Resource management Ordered atomics 

Explicit multiadapter SV_Barycentric to PS 

Tiled resources + bindless Exposed CSAA/EQAA samples 
Сопвегуац абісг + ROV Shading language with templates 


му уху 
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